Hybrid prediction of infections and deaths due to COVID-19 in two Colombian data series

The prediction of the number of infected and dead due to COVID-19 has challenged scientists and government bodies, prompting them to formulate public policies to control the virus’ spread and public health emergency worldwide. In this sense, we propose a hybrid method that combines the SIRD mathematical model, whose parameters are estimated via Bayesian inference with a seasonal ARIMA model. Our approach considers that notifications of both, infections and deaths are realizations of a time series process, so that components such as non-stationarity, trend, autocorrelation and/or stochastic seasonal patterns, among others, must be taken into account in the fitting of any mathematical model. The method is applied to data from two Colombian cities, and as hypothesized, the prediction outperforms the obtained with the fit of only the SIRD model. In addition, a simulation study is presented to assess the quality of the estimators of SIRD model in the inverse problem solution.


Introduction
COVID-19 is a highly infectious disease caused by the SARS-CoV-2 virus, first reported on 31 December 2019 in Wuhan, China. In early 2020, the disease crossed borders and due to its severity and the high rates of virus spread, on 11 March 2020, the World Health Organization (WHO) declared it a pandemic. Since then, a large part of the scientific community has been investigating the new coronavirus from different perspectives, seeking to control or prevent high infection and death rates. Particularly, it was necessary to develop and apply mathematical models to provide short-, medium-and long-term forecasts of the infection to enable governments to develop control strategies [1][2][3]. Different mathematical models based on nonlinear systems of ordinary differential equations (ODE), statistical models for time series and computational models, among others, have been used to describe the dynamics of the disease.
The most used mathematical approaches are the compartmental models, where the population is divided into states or sub-populations of Susceptible (S), Infected (I), Latent (L), Recovered (R), Vaccinated (V), Dead (D) and so on. In these models, the change from one state to another occurs through a transition rate and these rates, together with the basic reproductive number (R 0 ) are denominated the model parameters, which can vary in function of the population, geographic location, sociocultural context, climate and control measures, among other factors. This type of models is based on differential equations systems and its solution can be approached from a direct or inverse perspective. From the direct perspective, given a model, a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 initial conditions and a set of parameter values obtained from previous studies or at the discretion of the researcher, the number of individuals at each stage is predicted. This makes the predictions less accurate in comparison with real data. In turn, the inverse approach-using the official epidemiological data-estimates the model parameters through some statistical estimation method (e.g., classical or Bayesian inference) and then uses these values to solve the direct problem, obtaining a more realistic fit to the data observed [4,5].
Solutions of the inverse problem in epidemic models through Bayesian inference have been widely used to study the behaviour of COVID-19. For example, [6] estimated the parameters of the SIR and SEIR models to study the COVID-19 in South Africa; [7] proposed a Bayesian method for the real-time characterization and forecasting of COVID-19 in California and regions of New Mexico; [8] analysed the dynamics of COVID-19 in Germany, by applying a novel combination of epidemiological modelling with specialized neural networks; [9] developed a Bayesian approach based on the probabilistic SIR model using data regarding daily confirmed cases of COVID-19 from six states in the United States; [10] studied the behaviour of COVID-19 in France, performing Bayesian estimation of the initially infected and the parameters of the compartmental SIRD model. A more sophisticated approach considers epidemiological models with time-dependent parameters in order to account for different control strategies, such as remote working from home, closure of schools, lockdowns, etc. For instance, [11] proposed an extended Markov SIR model and developed a R-routine with a calibration procedure that incorporates various types of time-varying quarantines for underreported infected cases in China; [12] proposed a calibration of the SEIR model to noisy data based on the negative binomial generalized additive model (GAM) for determining the timevarying effective contact rate and used this approach to fit the number of daily cases of COVID-19 in Ireland; [13] extended the parsimonious branching process model of the spread of disease and applied Bayesian inference to estimate the instantaneous reproduction number (R t ) to explore the impact of heterogeneity on the distribution of secondary infections by COVID-19 in Ireland. On the other hand, the statistical models most applied for forecasting have been the ARIMA family, which predict future observations of a time series based on the linear dependence-deterministic or stochastic-of past [14][15][16][17][18][19]. In the search to improve both nowcasting and forecasting, different hybrid and ensemble approaches have been explored with accuracy results many times better than others; see for instance [20,21]. However, although many models have been used in the analysis of Covid-19, to the best of our knowledge, there is no methodology that combines both the dynamics between populations and the dependence of past.
Motivated by the behaviour of COVID-19 in Colombia [22,23], we propose a hybrid method to predict the daily values of infected and dead by COVID-19. Initially, the inverse problem of the SIRD model is solved by estimating the parameters via Bayesian inference through the Monte Carlo algorithm based on Markov Chains (MCMC), extending the work in [24]. In the estimation process we assign two prior distributions for each parameter and implement the selection criteria of Bayesian models: log of pseudo-marginal likelihood (LPML) [25] and observed information criterion (DIC3) [26], then we fit a seasonal ARIMA model to the residuals obtained from the discrepancy between the observed data and the best SIRD model selected.
The rest of the article is organized as follows. Section 2 presents the materials and methods, describes the data of the application, the hybrid and SIRD models, Bayesian estimation of its parameters and the comparison criteria of the Bayesian models. It also presents a simulation study that evaluates the statistical properties of the Bayesian estimators. Section 3 applies the method proposed to data from the cities of Calarcá and Pasto in Colombia and, finally, Section 4 presents the conclusions and suggestions for future research.

The data
The data used in this work are the daily cases of people infected and dead by COVID-19 reported by Colombia's National Health Institute (https://www.ins.gov.co/Paginas/Boletinescasos-COVID-19-Colombia.aspx) for the cities of Calarcá and Pasto. Specifically, information was used from 21 March (date when the first case was reported) to 28 July 2020 for Calarcá and from 27 March (date in which four cases had been reported) to 31 July 2020 for Pasto. Information on the total number of inhabitants (N) for both cities was obtained from the website of Colombia's National Administrative Department of Statistics, (DANE), which for Calarcá is N = 74890 and for Pasto is N = 392589. In both cases, it was assumed that the number of recovered and of deaths at the start of the pandemic was zero.

The hybrid model
Let {y j , j = 1, . . ., n} be a realization of the stochastic process fY t ; t 2 t � Rg and assume that the time series can be decomposed as where the term f(X θ (t j )) is called the signal and {� j , j = 1, 2, . . ., n} are the errors. In many situations it is assumed that errors are independent and identically distributed Nð0; s 2 � Þ. In this case, if f(X θ (t)) is a deterministic function then (1) is a regression model and observations are non-correlated [27]. The component f(X θ (t)) describes the trend of the observations, which can be linear, polynomial, exponential, logistic or harmonic polynomial, and so on. A special case occurs when X θ (t) is the solution of a mathematical model of the form and f is the identity, in this case the solution of (2) corresponds to the regressor of (1). Prior estimation of vector of parameters θ from model (2) is known as the inverse problem. On the other hand, as we said before, two common assumptions are the independence and normality of the random errors. However, autocorrelation, non-stationarity or stochastic seasonality can remain in the residuals. One way to correct the model is by fitting an appropriate model to the series {� j , j = 1, 2, . . ., n}. In modelling the COVID-19 data, ARIMA and seasonal ARIMA (SARIMA) have been widely used due to their satisfactory forecasts. The SARIMA(p, d, q) × (P, D, Q) s model for time series {� j , j = 1, 2, . . ., n} is given by and the hybrid model is given by

The deterministic model
Accordingly to (2) and for data before vaccination, in this work we employ the classical SIRD model [2] where the population N(t) is further divided into: individuals who are susceptible S (t), infected I(t), recovered R(t) and dead D(t). Moreover, it is assumed that the population is homogeneous and remains constant over a given infectious period, i.e., is given by the following nonlinear system of differential equations: where I(t 0 ) = I 0 > 0, D(t 0 ) = D 0 � 0, and R(t 0 ) = R e � 0 are the initial values of infected, dead and recovered individuals, respectively, β is the infection rate, ϕ is the recovery rate, γ is the mortality rate. t ¼ 1 gþ� denotes the average time of the infectious period and R 0 = β/(ϕ + γ) is the basic reproductive number. It is well known that if R 0 > 1, the epidemic propagates and if R 0 < 1, the epidemic subsides. It is important to note that the SIRD model is typically written in terms of the rates β, γ and ϕ. However we reparameterized it in terms of R 0 , γ and ϕ observing that β can be obtained through ϕ, γ and R 0 . This reparameterization was motivated by knowledge of the behavior of R 0 for COVID-19.

Inverse problem solution
2.4.1 Statistical model. Let z = [y; d] n×2 be a matrix whose columns y = (y 1 , . . ., y n ) > and d = (d 1 , . . ., d n ) > are two independent random vectors of n observations that follow the dynamics of the SIRD model with unknown parameters, where y j and d j represents, respectively, the number of infections and deaths reported between day j − 1 and j, with j = 1, . . ., n. This period of time is denoted by the interval (t j−1 , t j ]. Due to the nature of y j and d j , their probabilistic behaviour is modelled through independent Poisson distributions, that is, y j �

Likelihood function.
Let θ = (R 0 , ϕ, γ) > be the vector of parameters in the SIRD model, the likelihood function of θ given the vector of observations z, is given by Our solution of the inverse problem seeks to first estimate the vector of parameters θ of the model (5) from the number of infections and deaths reported in a population during a period of time and then solve the SIRD model using those values. Specifically, we propose to perform the estimation process of θ from a Bayesian approach through MCMC algorithms.

Bayesian inference. 2.4.3.1 Prior and posterior distributions.
We define non-informative priors for ϕ and γ and, for R 0 we use two proposals: i) a prior distribution that takes into account the information available about this parameter from the Colombian National Health Institute and ii) an exponential distribution. Specifically, we allocate ϕ * U(0, 1), γ|ϕ * U(0, 1 − ϕ) because the average duration of the infectious period (τ) must be greater than one day, which implies the relation ϕ Therefore, the joint prior distribution of vector θ can be obtained through π(θ) = π(R 0 ) × π (ϕ) × π(γ|ϕ), which can be rewritten under i) as π 1 The estimation procedure consists of simulating values of the posterior distribution of θ, which is obtained by combining the prior distribution π i (θ), i = 1, 2, with the likelihood function (7); that is pðθ j zÞ / pðθÞ � Lðθ j zÞ: However, due to the complexity of this process, caused by the dependence of this distribution on the solution of a system of ODE, we implement the t-walk algorithm [28] through the R package Rtwalk, since it is particularly well suited for generating samples from a posteriori distributions using nonstandard models.
This algorithm generates samples from continuous distributions starting at two independent random points in the sample space from there each move is generated from one of four proposed distributions (see the options in [28]) and accepted with a probability as in Metropolis-Hastings on the product space, which implies that two chains are not generated independently of each other. The algorithm can be summarized as follows: i) two points are randomly and independently selected in the sample space; ii) then with probability of 0.5 the value that will be updated is selected; iii)subsequently, the proposed distribution is randomly selected (among four possible ones) and a new value is generated while the other remains unchanged and iv) this value will be accepted through calculation of the acceptance probability of the Metropolis-Hastings algorithm.
Finally, the implemented algorithm can be summarized as follows: 2). Randomly and independently generate two initial values for each of the parameters of the SIRD model: R 0 , ϕ and γ.

4). Once the integrals have been determinated numerically in step 3)
, values of the a posteriori distribution π(θ|z) are generated. 5). Return to step 2) and repeat the process until convergence is obtained.
Upon obtaining the results from two chains, 250000 samples, called burn-in, are discarded, and 50000 more samples (with spacing of 300) are used to calculate posterior summaries.
Convergence was monitored through the statistical diagnostic potential scale reduction factor (psrf) proposed by Gelman and Rubin [29]. The psrf denoted by b R allows monitoring the convergence to the target distribution (posterior distribution in this case) of two randomly initialized chains calculating the ratio between the average of the variances obtained in each chain and the variance of the grouped chains; if the chains have reached the target distribution, this ratio will be close to 1 (in this work b R < 1:2 was used), otherwise b R will be far from 1.

Bayesian model selection.
To select the model that best fits the COVID-19 data in each city analysed, the Bayesian criteria for model selection called log pseudo-marginal likelihood (LPML) [30] and a modified version of the DIC, named DIC 3 [26], were implemented.
The LPML summarizes the conditional predictive ordinate (CPO), obtained from the posterior predictive distribution. For the j-th observation the CPO, denoted by CPO j , is expressed as is the probability mass function of z j obtained as the product between Poissonðy * j ðθÞÞ and Poissonðd * j ðθÞÞ, D denotes the full dataset, D ðÀ jÞ the data without the j−th observation and π(θ|�) is the posterior distribution from θ. Because of the lack of a closed form for this integral, the CPO j is estimated through a harmonic-mean approximation as proposed by [31] , where θ 1 , . . ., θ Q is a post burn-in sample of size Q from pðθjDÞ. Finally the LPML is calculated as The best fit is given by the largest LPML.
Although other criteria can be used, in this work we implemented an alternative to DIC [32], named DIC 3 [26], due to the complexity of our model since the parameters of the Poisson distributions are obtained from the solution of a system of ODE. This criterion is defined as
For each parameter in θ, two Markov chains of 30000 size are generated and their convergence is evaluated through the psrf ( b R < 1:2), as mentioned previously. After discarding a burn-in of 15000 values and applying a lag of 20, the posterior mean is estimated. We also estimate the relative bias (RB) of b y r , r = 1, 2, 3 given by RB ¼ 1 100 P 100 l¼1 ð b y r l =y r À 1Þ and the mean square error, MSE ¼ 1 100 P 100 l¼1 ð b y r l À y r Þ 2 , where b y r l is the estimator of θ r obtained with the l-th simulated sample, z l . Tables 1 and 2 summarize the results for n = 30 and n = 60, respectively. Notice that (i) the MSE decreases with the sample size increases; (ii) both RB and MSE are reasonably small, suggesting a good precision and (iii) the estimator of R 0 has the best accuracy and precision. So, based on these simulation results, it can be concluded that Bayesian estimators of the SIRD

Results and discussion
This section reports the application of the hybrid method proposed to predict the daily notifications of infected and dead due to COVID-19 in two Colombian cities: Calarcá and Pasto.
Initially, the SIRD model is estimated and fitted to the data, then a SARIMA(p, d, q) × (P, D, Q) s is fitted to the residuals fb � j g to calculate the predictions of the hybrid model (4).

Covid-19 predictions for Calarcá-Colombia
For Calarcá we considered the following initial conditions: population size N = 74890, since the first infection was notified on 21 March 2020, so I(0) = 1, S(0) = 74889, R(0) = 0 and D(0) = 0. The estimation of the SIRD model parameters was based on two joint prior distributions: π 1 (θ) = U(1, 3) × U(0, 1) × U(0, 1) and π 2 (θ) = exp(1) × U(0, 1) × U(0, 1). In joint distribution π 1 (θ), the uniform prior distribution U(1, 3) for R 0 was chosen based on the fact that the Colombia's National Health Institute established that at the beginning of the pandemic 1 < R 0 < 3 [33]. In the joint distribution π 2 (θ), an exponential distribution with parameter 1 was chosen to make sure the parameter was positive. For each parameter vector we generated two Markov chains of size 100000 from the posterior distribution π i (θ j z), i = 1, 2, and convergence of chains was monitored using the psrf ( b R < 1:2), as mentioned previously. Then, a burn-in of 50000 and a lag of 20 were applied on the chains to calculate the posterior mean. In order to select the best posterior mean we calculated the LPML and DIC3 selection criteria for posterior distributions obtained with π 1 (θ) and π 2 (θ) priors, and values are given in Table 3. Notice that there is no significant difference between the two models, so we can use either of them. In this case we prefer π 1 (θ j z). Table 4 shows the posterior means, standard deviations, and 95% CI based on quantiles 0.025 and 0.975 of SIRD model parameters for Calarcá data. We can conclude that at the beginning of pandemic i) on average, one infected person could infect approximately 1.1 susceptible people during the infectious period; ii) for every 1000 individuals infected, 126 recovered per day, 2.8 died due to the disease and the others continued infected; iii) the average time of duration of the infectious period, τ, is approximately eight days; iv) with 0.95 of probability: a) the number of individuals infected by an infected person varied between 1.096 and 1.122; b) for each 1000 infected individuals, between 113 and 135 people recovered daily and between 0.08 and 9.2 individuals died daily due to COVID-19. Fig 1(a) and 1(b) show respectively, the cumulative number of infections and deaths (blue squares) that were notified by the [34] between 20 March and 28 July 2020 (training set) with the predictions (black curves) from the respective estimated SIRD model.

Covid-19 predictions for Pasto-Colombia
In this case the population size is N = 392589 and since there were four initially notified cases on 27 March 2020, the initial conditions are I 0 = 4, S(0) = 392585, I(0) = 4, R(0) = 0 and D(0) = 0. We used the same priors as above and calculated the LPML and DIC3 criteria to select the best posterior distribution. According to the results in Table 5 the posterior π 1 (θ j z) is selected as the best model because it has the highest LPML and lowest DIC3. Table 6 shows the Bayesian estimations, accordingly to which we can conclude that at the beginning of the pandemic i) on average, for each 1000 infected individuals daily 71 recovered, three died, and the others continued infected; ii) the average duration of the infectious period, τ, is 13 days; iii) on average, an infected person can infect 1.6 susceptible people during the infectious period; iv) with a probability of 0.95, a) the number of susceptible individuals that can be infected by an infected person varies between 1.5 and 1.9; b) for each 1000 individuals infected, between 53 and 92 recovered daily, and the others died due to the infection or remained infected; c) for every 1000 infected individuals with COVID-19, between 2.6 and 3.8 died, the other individuals recovered or remained infected. In all cases, the RMSE is much lower in the hybrid approach. It is important to highlight that the average times of transmission from an infected person, estimated as a function of b � and b g in the SIRD model in the cities of Calarcá and Pasto, are close to those reported in prior studies. For instance, [35] found that for 30 provinces in China and 15 cities in the province of Hubei, this time varied from 7 to 14 days, and according to the Center for Coordination of Sanitary Alerts and Emergencies [36], it was determined that the COVID-19 infectious period was between 9 and 17 days. These results agree with our estimations of τ for Pasto and Calarcá of 8 and 13 days, respectively. Moreover, our estimations of R 0 for both cities are also close to those reported by [37], that in some European countries 1.4 < R 0 < 6.49 with a mean value of 3.28, while in Colombia, the National Health Institute reported an R 0 < 3 indicating that our estimations of 1.107 and 1.659 for Calarcá and Pasto, respectively, are reasonable from the epidemiological point of view.

Conclusions
The great challenge that the COVID-19 pandemic has caused in the search for models that provide good predictions of the number of infected, dead and recovered, in the short-and medium-term is well known. These studies have enabled government entities to develop actions that minimize the viral contagion speed and to reduce the economic and social impacts on the population.
Motivated by this challenge, we propose a hybrid method based on the combination of the SIRD compartmental model and a SARIMA model. The SIRD model captures the series trend and takes into account the interactions among the distinct states of the population, while the SARIMA model incorporates stochastic components such as non-stationarity, autocorrelation and seasonality that are remaining in the residuals. We use our proposal to predict the behaviour of the series of infections and deaths due to COVID-19 in two Colombian cities, obtaining better performance, in the sense that it significantly diminished the values of RMSE, compared with the fit of only the SIRD model. In spite of being used with these data, the hybrid method described in this work can be applied to predict other infectious diseases.
Future studies that can be derived from this work include: (i) modelling both the number of infections and number of deaths through the negative binomial distribution and comparing the results obtained with the Poisson distribution used in this work; (ii) evaluating the effect of climatic, social and economic variables, for instance, on the parameters of the SIRD model; and (iii) extending the method proposed to other compartmental models that permit incorporating information about control measures, like those taken by different governments to reduce virus propagation.